Toward Recovery-Oriented Computing
نویسنده
چکیده
Recovery Oriented Computing (ROC) is a joint research effort between Stanford University and the University of California, Berkeley. ROC takes the perspective that hardware faults, software bugs, and operator errors are facts to be coped with, not problems to be solved. This perspective is supported both by historical evidence and by recent studies on the main sources of outages in production systems. By concentrating on reducing Mean Time to Repair (MTTR) rather than increasing Mean Time to Failure (MTTF), ROC reduces recovery time and thus offers higher availability. We describe the principles and philosophy behind the joint Stanford/Berkeley ROC effort and outline some of its research areas and current projects. 1 The Case for ROC and “Peres’s Law” If a problem has no solution, it may not be a problem but a fact, not to be solved but to be coped with over time. —Shimon Peres Despite marketing campaigns promising 99.999% availability, well-managed servers today achieve 99.9% to 99%, or 8 to 80 hours of downtime per year . Each hour can be costly, from $200,000 per hour for an Internet service like Amazon to $6,000,000 per hour for a stock brokerage firm [Kembel00]. Total cost of ownership ranges from 3 to 18 times the purchase cost of many cluster-based systems, and a third to a half of that money is spent recovering from or preparing for failures [Gillen02]. Despite decades of research that have achieved four orders of magnitude in performance, large cluster-based systems and end-user terminals alike still fail, and we have not made sufficient headway in curbing failures to keep up with the increasing complexity of our systems and our dependence on them. We conclude that such failures are a fact of life: not a problem that will someday be solved once and for all, but a reality that we must live with. We propose to cope with this reality through fast and graceful recovery. The quantitative rationale for the ROC approach may be summarized as follows. A widely accepted equation for system availability is A=MTTF/(MTTF+MTTR), where MTTF is the mean time to system failure and MTTR the mean time to recovery after a failure. The target is to approach A=1.0, and much historical effort has focused on achieving this by pushing MTTF towards infinity—making hardware ever more reliable, investing more resources in software design and testing, employing redundancy to allow continuous operation in the presence of partial failures, and so on. We argue that an alternate way to approach A=1.0 is to focus on making MTTR<<MTTF. Of course, to some extent this has been embraced in communities such as hardware design and database design; in fact, it is often the case that in those systems, rapid recovery is used within a particular layer of functionality to prevent a visible failure in a higher layer, perhaps by making it visible as only a performance blip, so that effectively a sufficiently reduced MTTR in one layer is manifest is increased MTTF in higher layers. What is new in ROC is that we believe it is time to apply emphasis on reducing MTTR at the highest layers—the application and its end users—as a way of improving availability, and that numerous if currently anecdotal successes from the Internet systems community can help illuminate the way to do this. 1.1 Why Focus On Recovery There are several reasons we have chosen to focus squarely on recovery: 1 Thanks to Lisa Spainhower for this elegant generalization of the observation. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment Proceedings of the 28 VLDB Conference, Hong Kong, China, 2002 Human error is inevitable. Over 50% of outage incidents, and a comparable fraction of outage minutes, are due to operator error. We conducted two surveys that confirmed this data in both the public switched telephone network [Enriquez02] and for a selection of representative large-scale Internet cluster services [Oppenheimer02]. MTTR can be directly measured. Today’s disks have quoted MTTF’s of 120 years. Verifying such claims requires many system-years of operation, which is beyond the reach of all but the largest customers. In contrast, the longest MTTR’s for commercial database products are on the order of days, and many are on the order of hours, making MTTR claims verifiable. Lowering application-level MTTR can directly improve the user experience. In April 2002, Ebay had a 280-minute sustained outage affecting most of its services. This and similar previous outages are highly visible, newsworthy, and affect customer loyalty and investor confidence [Dembeck99]. In contrast, had Ebay suffered one 6-minute outage per week, they would have achieved the same availability (according to the formula) but the individual outages are probably not newsworthy because they affect far fewer users. The difference, of course, is that in the latter case the same availability is achieved by having a much shorter MTTR. Frequent “recovery” may lengthen effective MTTF. Software rejuvenation [Garg97] and recursive restartability [Candea01] both exploit the observation that by returning a system periodically to its start state (typically a well understood and heavily tested state), we can reclaim stale resources, clean up corrupted state and other side effects of software aging, and eliminate the corresponding side effects (e.g. performance degradation due to memory leaks), and that we can do these things by relying on well-tested but limited-functionality hardware support such as the virtual memory system. 1.2 The ROC Research Agenda Anecdotally, we know that some systems support (or at least tolerate) some of the above scenarios well; for example, “rolling reboots” are standard procedure for most cluster-based services [Brewer01]. The ROC research questions may therefore be stated as follows: ?? For the recovery scenarios above, what does it mean for the corresponding system or subsystem to be designed for recovery? ?? How can we identify and classify faults so that we can select the most effective strategy when recovery is needed? ?? How will we measure our success? We now describe a sampling of ROC work in progress that addresses each of these questions. 2 Some Research Areas and Projects Our initial targets are large Internet-scale and corporate-scale applications: email, portals, wide area storage, and so on. We chose these in part because they have interesting state management requirements, but often do not need transactional guarantees, allowing us to ask whether recovery can be improved by trading some of those guarantees away. Also, the very large scale of these systems brings some tradeoffs into sharp relief: search engines feature a single specialized application running on thousands of nodes, and some guarantees such as consistency must be relaxed in order to meet throughput and latency requirements and provide for incremental scaling [Brewer01]. Finally, these services are trying to build mission-critical functionality from semi-reliable COTS parts in the face of high feature churn; we believe these constraints and the attendant market pressures are indicative of many future mission-critical systems, so our recovery strategies should address these cases. Measurement and Benchmarking. We are building on our and others’ earlier work on availability benchmarking [Brown00,Lambright00] to come up with metrics that capture more than just “up or down” availability, such as graceful performance degradation during recovery. Similarly, we are considering the enduser-visible effects of different forms of unavailability [Merzbacher02] and ways to incorporate human operator behavior in dependability benchmarks [Brown02b]. Given our stated focus on recovery, we are gratified to see initial industrial support for benchmarking recovery from various kinds of failures as well [Zhu02]. Recursive Restartability. A recursively restartable system [Cand01] gracefully tolerates successive partial restarts at multiple levels, which can be used to recover from transient failures more quickly than a full reboot woudl require. To apply RR to a system, we construct a restart tree that captures restart dependencies among components: restart tree nodes are highly fault-isolated and a restart at any node will restart the entire corresponding subtree rooted at that node. To enforce the containment boundaries and the subtree restart behavior, we rely on hardware-level support such as virtual memory, process groups, and physical node boundaries. A policy oracle decides which subtree to restart in response to a particular detected failure; a simple arithmetic model quantifies the cost of the oracle making a mistake. We have successfully applied RR to an amateur satellite ground station controller [CCF+02] to reduce its time to recovery by a factor of 3-4x, and are currently investigating “design for restartability” for stateful components that are required to provide bounded or probabilistic data durability and integrity. System-level Undo for Operators. We are “wrapping” an off-the-shelf IMAP mail server with system-level Undo functionality, to recover from (e.g.) administrative errors that would otherwise cause data loss or an unacceptable user-perceived data inconsistency [Brown02]. A system with undo also provides a forgiving environment that promotes ingenuity and exploration: the system operator can try innovative solutions to problems without the fear of permanent disasterous consequences, and an operator-in-training can safely learn by making mistakes and recovering from them. Psychology has shown that this approach of learning by trial-and-error is one of the most effective method of human learning [Reason90], yet only with an undoable system is the cost of mistakes low enough to make it feasible. Fault injection. FIG (Fault Injection in glibc) is a lightweight, low-overhead, extensible tool for triggering and logging errors at the application/system boundary. FIG uses the LD_PRELOAD environment variable to interpose itself between the application and glibc, the GNU C library, causes some libc calls to fail to simluate a failure in the operating environment. Using FIG to trigger such faults in a variety of applications from desktop applications to transaction servers, we have been able to start classifying the successful recovery techniques that appear in the applications that fare best under fault injection [BST02]. Failure detection and diagnosis. Pinpoint [CKF+02] is a framework for root-cause analysis in large distributed component applications such as e-commerce systems. Pinpoint tags a subset of client requests as they travel through the system, uses traffic sniffing and middleware instrumentation to detect failed requests, and then applies data mining techniques offline to correlate the failed and successful requests to determine which component(s) were likely to be at fault for the failures. Because it is implemented on the Java 2 Enterprise Edition application server itself, existing J2EE applications can use Pinpoint unmodified; experiments show that it identifies faulty components with high accuracy and a low false-positive rate. 3 Inspiration From Prior Work Outside of computer science and engineering, we are scanning the literature in disaster handling in emergency systems such as nuclear reactors [Perrow90], human error and “automation irony” [Reason90], and civil engineering failures [Petroski92]. Within computer science and engineering, we look to three large research communities for inspiration and ideas: hardware-level and missioncritical-system fault tolerance, commercial transaction systems, and the Internet systems community. Hardware and system fault tolerance, whether at the component level or instruction-set architecture level, serves the important function of assuring that the underlying system behaves according to a particular welldefined specification; for example, instruction-level retry in the IBM G5 and other mainframes assures that the hardware behaves according to the ISA specification. Purpose-designed safety-critical systems such as the Space Shuttle software have an excellent reliability record as a full system, but at great cost in both maintenance effort and difficulty of making changes. Although we are most interested in exploring recovery at higher layers and with higher-churn components, we expect to be able to apply some of the ideas from this community in analogous ways. The Internet and systems community have already begun investigating ways to systematize tradeoffs such as availability vs. consistency [Yu00] and how to substitute soft state for hard state in many kinds of applications [Raman99]. Both approaches have the potential to simplify recovery, and although both require explicit support at application design time, application-transparent recovery has been shown to be inapplicable in a broad range of common failure cases [Lowell00]. The database community has deeply explored failure recovery techniques that provide the strongest guarantees for data integrity; one may say without exaggeration that the sophisticated engineering in such systems today sets the standard for data integrity guarantees after recovery. It has been observed that not all applications require the guarantees such systems provide, and especially at extreme scale, it may be beneficial or necessary to trade some of those guarantees away for improved availability [Fox99]; we ask whether they might instead be traded for faster recovery. In summary, we expect our work to be complementary to the large body of existing work in fault tolerance, and we hope to gain insights and ideas from real collaborations with those communities.
منابع مشابه
An Autonomic Service Oriented Architecture in Computational Engineering Framework
Service Oriented Architecture (SOA) technology enables composition of large and complex computational units out of the available atomic services. Implementation of SOA brings about challenges which include service discovery, service interaction, service composition, robustness, quality of service, security, etc. These challenges are mainly due to the dynamic nature of SOA. SOAmay often need to ...
متن کاملAn Autonomic Service Oriented Architecture in Computational Engineering Framework
Service Oriented Architecture (SOA) technology enables composition of large and complex computational units out of the available atomic services. Implementation of SOA brings about challenges which include service discovery, service interaction, service composition, robustness, quality of service, security, etc. These challenges are mainly due to the dynamic nature of SOA. SOAmay often need to ...
متن کاملEffectiveness of Mindfulness Oriented Recovery Enhancement Approach on Attentional Bias and Disability in Chronic Pain Patients
Aims and background: Selective attention to pain-related stimuli, known as pain attentional bias (AB) can exacerbate pain, disability and undermine quality of life. The aim of this study was to determine effectiveness of mindfulness oriented recovery enhancement approach on attentional bias related to pain and disability among Chronic Pain Patients. Materials and methods: The present study was...
متن کاملEmbracing Failure: A Case for Recovery-Oriented Computing (ROC)
Motivated by the lack of availability demonstrated by current approaches to building servers for the Internet environment, we argue for a new approach to building highly-available systems that better reflects the realities of the modern server environment, namely that failures of hardware, software, and humans are inevitable. Our approach, denoted recovery-oriented computing (ROC), recognizes t...
متن کاملShifting Practices Toward Recovery-Oriented Care Through an E-Recovery Portal in Community Mental Health Care: A Mixed-Methods Exploratory Study
BACKGROUND Mental health care is shifting from a primary focus on symptom reduction toward personal recovery-oriented care, especially for persons with long-term mental health care needs. Web-based portals may facilitate this shift, but little is known about how such tools are used or the role they may play in personal recovery. OBJECTIVE The aim was to illustrate uses and experiences with th...
متن کاملReliability Service for Service Oriented Architecture
Nowadays, a major paradigm of large scale distributed processing is service-oriented computing (or SOA, Service Oriented Architecture). To improve availability and reliability of SOA based systems and applications, a Reliability Service (providing external support of web services recovery) is proposed.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002